Report: YIEDL Experiment 2 (YIEDL-Numerai Dataset)

Author

Joe (Degenius Maximus) Chow

Published

April 29, 2025

Version: 0.1


1 Introduction

For the Yiedl competition we distribute a daily dataset, with week-to-week targets, which has fewer features than the new dataset given to Numerai for their crypto competition. We should compare the performance of the two dataset under a variety of models and check if it makes sense to push the adoption of the new dataset in our competitions.

Experiment two is, in many ways, similar to experiment one. The main focus here is on the Full Daily Data (i.e. the dataset for Numerai Crypto with 3000+ features). In addition to making out-of-bag predictions on weekly test data only, this experiment also covered the predictions on daily test data.

1.1 Results Overview (Long Story Short)

If you’re swamped with deadlines and your coffee’s ice-cold, here’s the straight-to-the-point rundown on the results and conclusion to keep you informed without the fuss!

  • Daily models do not improve out-of-bag predictive performance on weekly neutral targets, indicating the Full Daily Data might not be suitable for the classic YIEDL Neutral competition in the current weekly tournament format. ❌

  • Daily models do show better out-of-bag predictive performance on daily neutral targets, indicating the Full Daily Data is suitable for the daily submission (i.e. the current Numerai Crypto format). ✔️

  • Daily models do show better out-of-bag predictive performance on both weekly updown targets, indicating the Full Daily Data is suitable for the classic YIEDL Updown competition in the current weekly tournament format. ✔️

  • Daily models do show better out-of-bag predictive performance on daily updown targets. Yet, this might not be helpful for Numerai Crypto as their targets are similar to the normalised netural targets. 🤷

See conclusions below for more details.


2 Experiment Set-up

2.1 Datasets

The following two datasets from https://yiedl.ai/competition/datasets were used:

  1. Targets from YIEDL Weekly Data - dataset_weekly_2025_15.zip
  2. Targets from YIEDL Daily Data - dataset_daily_2025_15.zip
  3. Features from Full (aka YIEDL-Numerai) Daily Data - dataset_historical_20250401.zip

2.2 Training vs. Test Periods

  1. Training: 2018-04-27 to 2022-10-31
  2. Embargo: 2022-11-01 to 2022-12-31 (a two-month gap between training and test to avoid data leakage)
  3. Test: 2023-01-01 to 2025-03-31

2.3 Stats

  1. Training (Weekly) = 94625 samples from 2018-04-29 to 2022-10-30.
  2. Training (Daily) = 660847 samples from 2018-04-27 to 2022-10-31.
  3. Test (Weekly) = 142039 samples from 2023-01-01 to 2025-03-30.
  4. Test (Daily) = 988574 samples from 2023-01-01 to 2025-03-31.

2.4 Features and Targets

There are 3669 features and 2 targets (target_neutral and target_updown) in the datasets. Here is an example of the weekly training data:

             date symbol pvm_0001 sentiment_0001 onchain_0001 target_neutral
           <Date> <char>    <int>          <int>        <int>          <num>
    1: 2018-04-29    ADA       96              1            1      0.4102564
    2: 2018-04-29   AGIX       28             29            2      0.4358974
    3: 2018-04-29    BAT       48             12           51      0.3333333
    4: 2018-04-29    BCH       33             30          100      0.9743590
    5: 2018-04-29    BTC       36              0            0      0.7435897
   ---                                                                      
94621: 2022-10-30   ZORA       91             50           85      0.9228571
94622: 2022-10-30    ZRX       62             50            5      0.6071429
94623: 2022-10-30    ZYN        4             50           51      0.8485714
94624: 2022-10-30  eRSDL       91             50           89      0.1614286
94625: 2022-10-30  stETH       81             50           51      0.3814286
       target_updown
               <num>
    1:  -0.041188088
    2:  -0.040989237
    3:  -0.059565759
    4:   0.225918211
    5:   0.025025799
   ---              
94621:   0.171695311
94622:   0.019647720
94623:   0.092574865
94624:  -0.062505238
94625:  -0.009704795

2.6 Models

The models can be categorised into eight groups:

  1. 300 models trained with weekly data + target_neutral —> predict on weekly data
  2. 300 models trained with daily data + target_neutral —> predict on weekly data
  3. 300 models trained with weekly data + target_neutral —> predict on daily data
  4. 300 models trained with daily data + target_neutral —> predict on daily data
  5. 300 models trained with weekly data + target_updown —> predict on weekly data
  6. 300 models trained with daily data + target_updown —> predict on weekly data
  7. 300 models trained with weekly data + target_updown —> predict on daily data
  8. 300 models trained with daily data + target_updown —> predict on daily data

Note: each model is a simple average ensemble from three runs with three different random seed.

Note 2: since the complexity of the experiment has increased significantly due to increased number of features (from 1140 to 3669) and model groups (from 4 to 8). I had to reduce the size of the grid search (from 1008 to 300) in order to complete a reasonable number of runs within a few days.


3 Predictions

Here is an example of predictions from models trained with the neutral targets:

         date symbol yhat_weekly yhat_daily
       <Date> <char>       <num>      <num>
1: 2023-01-01  0xBTC  0.14035088 0.11336032
2: 2023-01-01   1ECO  0.60863698 0.68690958
3: 2023-01-01  1INCH  0.83265857 0.79082321
4: 2023-01-01    1WO  0.92712551 0.95411606
5: 2023-01-01    AAC  0.03643725 0.05668016
6: 2023-01-01   AAVE  0.76518219 0.85425101

Similarly, we can look at the predictions from models trained with the updown targets:

         date symbol yhat_weekly  yhat_daily
       <Date> <char>       <num>       <num>
1: 2023-01-01  0xBTC -0.04275151 -0.01003887
2: 2023-01-01   1ECO  0.05094402  0.09078478
3: 2023-01-01  1INCH -0.00897382 -0.00142727
4: 2023-01-01    1WO -0.03885654 -0.01133472
5: 2023-01-01    AAC -0.00622269  0.03552798
6: 2023-01-01   AAVE -0.00843999 -0.00597358


4 Evaluation Metrics

I am skipping the full explanations here as the metrics are exactly the same as the ones used in experiment one. Please see report one for more details.

  • Primary metrics: Spearman correlation, RMSE
  • Secondary metrics: Sharpe ratio, max drawdown, compound return, and trimmed mean.


5 Report Structure for Evaluation Results

Note: in experiment two, we also look at the out-of-bag performance on daily test data (from 2023-01-01). Therefore, we have two different model groups in this report:

  1. X-to-Weekly means both weekly and daily models are evaluated using the same weekly test data.
  2. X-to-Daily means both weekly and daily models are evaluated using the same daily test data.

In order to simplify things, the results for each evaluate metric are presented in the following structure:

  • Metric
    • Group One (X-to-Weekly)
    • Group Two (X-to-Daily)

You can use the table of contents to go through the following sections:

  • Mean Spearman Correlation (Groups One and Two)
  • Sharpe Ratio (Groups One and Two)
  • Max Drawdown (Groups One and Two) (omitted for now as some models have zero drawdown - not useful for comparison)
  • Compound Return (Groups One and Two)
  • Trimmed Mean RMSE (Groups One and Two)

The hypothesis / expectations are the same as those in experiment one so I am skipping them in this report.


6 Mean Spearman Correlation (Target Neutral)

6.1 Group One (X-to-Weekly)

6.1.1 Observations (Stats)

  1. No. of daily models with higher mean correlation = 15 out of 300 (5%) ❌

  2. Range of weekly-to-weekly models’ mean correlation (cor_w_w):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1617  0.1699  0.1727  0.1718  0.1740  0.1758 
  1. Range of daily-to-weekly models’ mean correlation (cor_d_w):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1601  0.1678  0.1700  0.1692  0.1715  0.1737 
  1. Range of raw performance differences (cor_daily - cor_w_w):
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.006481 -0.003803 -0.002387 -0.002625 -0.001400  0.001333 
  1. Range of percentage differences (%) (diff / cor_w_w):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.7811 -2.1805 -1.3763 -1.5247 -0.8179  0.7963 

6.1.2 Observations (Charts)


6.1.3 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. cor_w_w= mean correlation of weekly-to-weekly models’ predictions
  6. cor_d_w= mean correlation of daily-to-weekly models’ predictions
  7. diff = cor_d_w - cor_w_w (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / cor_w_w * 100 percentage difference (%)

6.2 Group Two (X-to-Daily)

6.2.1 Observations (Stats)

  1. No. of daily models with higher mean correlation = 217 out of 300 (72.3333333%) ✔️

  2. Range of weekly-to-daily models’ mean correlation (cor_w_d):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1496  0.1569  0.1590  0.1584  0.1603  0.1622 
  1. Range of daily-to-daily models’ mean correlation (cor_d_d):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1501  0.1577  0.1598  0.1592  0.1615  0.1637 
  1. Range of raw performance differences (cor_d_d - cor_w_d):
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-0.0022650 -0.0001138  0.0008720  0.0008062  0.0016975  0.0036830 
  1. Range of percentage differences (%) (diff / cor_w_d):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.47744 -0.07244  0.54741  0.50665  1.07789  2.39334 

6.2.2 Observations (Charts)


6.2.3 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. cor_w_d= mean correlation of weekly-to-daily models’ predictions
  6. cor_d_d= mean correlation of daily-to-daily models’ predictions
  7. diff = cor_d_w - cor_w_w (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / cor_w_w * 100 percentage difference (%)


7 Sharpe Ratio (Target Neutral)

7.1 Group One (X-to-Weekly)

7.1.1 Observations (Stats)

  1. No. of daily models with higher Sharpe ratio = 185 out of 300 (61.6666667%) ✔️

  2. Range of weekly-to-weekly models’ Sharpe ratio (shp_w_w):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.968   2.234   2.292   2.269   2.331   2.382 
  1. Range of daily-to-weekly models’ Sharpe ratio (shp_d_w):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.900   2.237   2.325   2.289   2.372   2.424 
  1. Range of raw performance differences (shp_d_w - shp_w_w):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.10498 -0.02078  0.02104  0.01942  0.05881  0.16253 
  1. Range of percentage differences (%) (diff / shp_w_w):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.8264 -0.8937  0.9031  0.8223  2.5508  7.3730 

7.1.2 Observations (Charts)


7.1.3 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. shp_w_w= Sharpe ratio of weekly-to-weekly models’ predictions
  6. shp_d_w= Sharpe ratio of daily-to-weekly models’ predictions
  7. diff = shp_d_w - shp_w_w (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / shp_w_w * 100 percentage difference (%)

7.2 Group Two (X-to-Daily)

7.2.1 Observations (Stats)

  1. No. of daily models with higher Sharpe ratio = 186 out of 300 (62%) ✔️

  2. Range of weekly-to-daily models’ Sharpe ratio (shp_w_d):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.882   2.177   2.244   2.214   2.274   2.316 
  1. Range of daily-to-daily models’ Sharpe ratio (shp_d_d):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.860   2.150   2.260   2.219   2.305   2.357 
  1. Range of raw performance differences (shp_d_d - shp_w_d):
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.147201 -0.017799  0.012998  0.005798  0.036700  0.099705 
  1. Range of percentage differences (%) (diff / shp_w_d):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-6.9324 -0.8128  0.5821  0.2261  1.6359  4.5313 

7.2.2 Observations (Charts)


7.2.3 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. shp_w_d= Sharpe ratio of weekly-to-daily models’ predictions
  6. shp_d_d= Sharpe ratio of daily-to-daily models’ predictions
  7. diff = shp_d_w - shp_w_w (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / shp_w_w * 100 percentage difference (%)


8 Compound Return (Target Neutral)

8.1 Group One (X-to-Weekly)

8.1.1 Observations (Stats)

  1. No. of daily models with higher compound return = 16 out of 300 (5.3333333%) ❌

  2. Range of weekly-to-weekly models’ Sharpe ratio (return_w_w):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  561.3   627.7   651.4   644.7   663.6   679.4 
  1. Range of daily-to-weekly models’ Sharpe ratio (return_d_w):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  549.5   610.2   629.0   622.5   641.8   660.3 
  1. Range of raw performance differences (return_d_w - return_w_w):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -55.05  -32.99  -20.41  -22.20  -11.70   11.04 
  1. Range of percentage differences (%) (diff / return_w_w):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -8.320  -4.960  -3.121  -3.426  -1.840   1.819 

8.1.2 Observations (Charts)


8.1.3 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. return_w_w= Compound return of weekly-to-weekly models’ predictions
  6. return_d_w= Compound return of daily-to-weekly models’ predictions
  7. diff = shp_d_w - return_w_w (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / return_w_w * 100 percentage difference (%)

8.2 Group Two (X-to-Daily)

8.2.1 Observations (Stats)

  1. No. of daily models with higher compound return = 217 out of 300 (72.3333333%) ✔️

  2. Range of weekly-to-daily models’ Sharpe ratio (return_w_d):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  476.5   528.3   543.7   539.4   554.0   568.2 
  1. Range of daily-to-daily models’ Sharpe ratio (return_d_d):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  480.3   534.1   550.0   545.6   563.2   580.1 
  1. Range of raw performance differences (return_d_d - return_w_d):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-15.7902  -0.8395   6.6565   6.1861  13.0039  26.7853 
  1. Range of percentage differences (%) (diff / return_w_d):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.1433 -0.1595  1.2135  1.1352  2.3816  5.2715 

8.2.2 Observations (Charts)


8.2.3 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. return_w_d= Compound return of weekly-to-daily models’ predictions
  6. return_d_d= Compound return of daily-to-daily models’ predictions
  7. diff = shp_d_d - return_w_d (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / return_w_d * 100 percentage difference (%)


9 Trimmed Mean RMSE (Target Updown)

9.1 Group One (X-to-Weekly)

9.1.1 Observations (Stats)

  1. No. of daily models with lower trimmed RMSE = 300 out of 300 (100%) ✔️

  2. Range of weekly-to-weekly models’ trimmed mean RMSE (rmse_w_w):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5199  0.6167  0.7073  0.7670  0.8039  2.4007 
  1. Range of daily-to-weekly models’ trimmed mean RMSE (rmse_d_w):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4636  0.5359  0.5824  0.5854  0.6266  0.7796 
  1. Range of raw performance differences (rmse_d_w - rmse_w_w):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.79362 -0.17714 -0.10102 -0.18162 -0.06290 -0.01259 
  1. Range of percentage differences (%) (diff / rmse_w_w):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-75.292 -24.176 -14.416 -19.009 -10.378  -1.656 

9.1.2 Observations (Charts)


9.1.3 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. rmse_w_w= Trimmed mean RMSE of weekly-to-weekly models’ predictions
  6. rmse_d_w= Trimmed mean RMSE of daily-to-weekly models’ predictions
  7. diff = rmse_d_w - rmse_w_w (i.e. negative differences mean the daily models are better)
  8. p_diff= diff / rmse_w_w * 100 percentage difference (%)

9.2 Group Two (X-to-Daily)

9.2.1 Observations (Stats)

  1. No. of daily models with lower trimmed RMSE = 299 out of 300 (99.6666667%) ✔️

  2. Range of weekly-to-daily models’ trimmed mean RMSE (rmse_w_d):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5275  0.6303  0.7185  0.7799  0.8204  2.4186 
  1. Range of daily-to-daily models’ trimmed mean RMSE (rmse_d_d):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4668  0.5515  0.5976  0.6023  0.6478  0.8049 
  1. Range of raw performance differences (rmse_d_d - rmse_w_d):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.80006 -0.18232 -0.09683 -0.17763 -0.06048  0.01493 
  1. Range of percentage differences (%) (diff / rmse_w_d):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -74.94  -23.41  -13.56  -18.19   -9.10    2.00 

9.2.2 Observations (Charts)


9.2.3 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. rmse_w_d= Trimmed mean RMSE of weekly-to-daily models’ predictions
  6. rmse_d_d= Trimmed mean RMSE of daily-to-daily models’ predictions
  7. diff = rmse_d_w - rmse_w_w (i.e. negative differences mean the daily models are better)
  8. p_diff= diff / rmse_w_w * 100 percentage difference (%)



10 Conclusions

  • A grid search (300 combinations of different xgboost parameters) was used for this experiments.

  • Pairs of weekly and daily models (trained using the same parameters) were used to produce out-of-bag predictions on the same weekly test data as well as the same daily test data from 2023-01-01.

  • Daily models do not improve out-of-bag predictive performance on weekly neutral targets, indicating the Full Daily Data might not be suitable for the classic YIEDL Neutral competition in the current weekly tournament format. ❌

  • Daily models do show better out-of-bag predictive performance on daily neutral targets, indicating the Full Daily Data is suitable for the daily submission (i.e. the current Numerai Crypto format). ✔️

  • Daily models do show better out-of-bag predictive performance on both weekly updown targets, indicating the Full Daily Data is suitable for the classic YIEDL Updown competition in the current weekly tournament format. ✔️

  • Daily models do show better out-of-bag predictive performance on daily updown targets. Yet, this might not be helpful for Numerai Crypto as their targets are similar to the normalised netural targets. 🤷

10.1 Summary (X-to-Weekly, Target Neutral)

  • Only a few (5%) daily models show HIGHER mean Spearman correlation compared to weekly models ❌
  • Most (62%) daily models show HIGHER (average 0.8% increase in) Sharpe ratio compared to weekly models, indicating better performance. ✔️
  • Only a few (5%) daily models show HIGHER compound return compared to weekly models. ❌

10.2 Summary (X-to-Daily, Target Neutral)

  • Most (72%) daily models show HIGHER (average 0.5% increase in) mean Spearman correlation compared to weekly models, indicating better performance. ✔️
  • Most (62%) daily models show HIGHER (average 0.8% increase in) Sharpe ratio compared to weekly models, indicating better performance. ✔️
  • Most (72%) daily models show HIGHER (average 1.1% increase in) compound return compared to weekly models, indicating better performance. ✔️

10.3 Summary (X-to-Weekly, Target Updown)

  • All (100%) daily models show LOWER (average 19% decrease in) trimmed mean RMSE compared to weekly models, indicating better performance. ✔️

10.4 Summary (X-to-Daily, Target Updown)

  • Most (99%) daily models show LOWER (average 18% decrease in) trimmed mean RMSE compared to weekly models, indicating better performance. ✔️